Based on results above, color, carat and clarity are the 3 variables that predict the trends in diamonds prices the best. I will use these variables to make the final model for price. However, there are 3 variables that quanify carat, and so first I perform a variable selection to determine the best set of carat variables to predict price. For this, I remove the extreme data points. I remove the diamonds whose prices are in top 2% of the prices. These may represent specialized markets, so I did not want to include these diamonds. This process removed 1079 of 53940 diamonds. I generated a training set of 85% of the data and a testing set as the remaing 15%.
Choosing the best carat variable.
The first task was to find the correct variable corresponding to carat value. For this, I fit log(price) to different combination of carat variables. These models were,
- m1: log(price) vs carat,
- m2: log(price) vs carat.cut,
- m3: log(price) vs carat.cut + DistPrefered
Results indicate that carat.cut + DistPrefered model explained greater percentage of price variance than carat alone (0.935 vs 0.847). I chose carat.cut + DistPrefered to model carat information, because this varaible captures 2 important features of the model. First, it considers the diamond’s prefered size and second it considered deviations above the prefered size.
##
## Calls:
## m1: lm(formula = I(log(price)) ~ carat, data = diamonds)
## m2: lm(formula = I(log(price)) ~ carat.cut, data = diamonds)
## m3: lm(formula = I(log(price)) ~ carat.cut + DistPrefered, data = diamonds)
##
## =================================================================
## m1 m2 m3
## -----------------------------------------------------------------
## (Intercept) 6.215*** 6.271*** 6.137***
## (0.003) (0.007) (0.008)
## carat 1.970***
## (0.004)
## carat.cut: (0.29,0.39]/(0,0.29] 0.308*** 0.423***
## (0.007) (0.008)
## carat.cut: (0.39,0.49]/(0,0.29] 0.607*** 0.725***
## (0.008) (0.008)
## carat.cut: (0.49,0.69]/(0,0.29] 1.147*** 1.250***
## (0.007) (0.008)
## carat.cut: (0.69,0.89]/(0,0.29] 1.622*** 1.730***
## (0.007) (0.008)
## carat.cut: (0.89,0.99]/(0,0.29] 1.991*** 2.112***
## (0.008) (0.009)
## carat.cut: (0.99,1.19]/(0,0.29] 2.328*** 2.430***
## (0.007) (0.008)
## carat.cut: (1.19,1.49]/(0,0.29] 2.574*** 2.667***
## (0.008) (0.008)
## carat.cut: (1.49,1.69]/(0,0.29] 2.957*** 3.065***
## (0.008) (0.009)
## carat.cut: (1.69,1.99]/(0,0.29] 3.113*** 3.213***
## (0.012) (0.012)
## carat.cut: (1.99,6]/(0,0.29] 3.313*** 3.364***
## (0.009) (0.009)
## DistPrefered 0.578***
## (0.018)
## -----------------------------------------------------------------
## R-squared 0.847 0.933 0.935
## adj. R-squared 0.847 0.933 0.935
## sigma 0.397 0.262 0.259
## F 298093.428 75679.438 70166.882
## p 0.000 0.000 0.000
## Log-likelihood -26726.512 -4223.837 -3727.590
## Deviance 8507.660 3693.565 3626.223
## AIC 53459.025 8471.675 7481.181
## BIC 53485.712 8578.422 7596.824
## N 53939 53939 53939
## =================================================================

Fig 33. Results of fitted values for different carat variable combinations
Final model to predict diamond price.
I finally combine clarity and color to carat.cut + DistPrefered to build a linear regression model to predict price of the diamond. After combining clarity and color, more than 98% of variance is explained by the model. This is also confirmed by plots of log(price) and predicted values.
##
## Calls:
## m1: lm(formula = I(log(price)) ~ carat.cut, data = diamonds_train)
## m2: lm(formula = I(log(price)) ~ carat.cut + DistPrefered, data = diamonds_train)
## m3: lm(formula = I(log(price)) ~ carat.cut + DistPrefered + clarity,
## data = diamonds_train)
## m4: lm(formula = I(log(price)) ~ carat.cut + DistPrefered + clarity +
## color, data = diamonds_train)
## m5: lm(formula = I(log(price)) ~ carat.cut + DistPrefered + clarity +
## color + clarity:color + carat.cut:color, data = diamonds_train)
##
## ===================================================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------------------------------
## (Intercept) 6.270*** 6.132*** 5.890*** 5.753*** 5.807***
## (0.007) (0.008) (0.006) (0.005) (0.009)
## carat.cut: (0.29,0.39]/(0,0.29] 0.309*** 0.427*** 0.568*** 0.634*** 0.589***
## (0.008) (0.008) (0.006) (0.005) (0.009)
## carat.cut: (0.39,0.49]/(0,0.29] 0.607*** 0.729*** 0.901*** 0.961*** 0.936***
## (0.008) (0.009) (0.007) (0.005) (0.009)
## carat.cut: (0.49,0.69]/(0,0.29] 1.148*** 1.255*** 1.454*** 1.502*** 1.454***
## (0.008) (0.009) (0.006) (0.005) (0.009)
## carat.cut: (0.69,0.89]/(0,0.29] 1.623*** 1.734*** 1.997*** 2.084*** 2.042***
## (0.008) (0.009) (0.006) (0.005) (0.009)
## carat.cut: (0.89,0.99]/(0,0.29] 1.990*** 2.115*** 2.450*** 2.554*** 2.513***
## (0.009) (0.010) (0.007) (0.005) (0.010)
## carat.cut: (0.99,1.19]/(0,0.29] 2.328*** 2.433*** 2.727*** 2.824*** 2.767***
## (0.008) (0.008) (0.006) (0.005) (0.009)
## carat.cut: (1.19,1.49]/(0,0.29] 2.578*** 2.673*** 2.937*** 3.101*** 3.045***
## (0.009) (0.009) (0.007) (0.005) (0.009)
## carat.cut: (1.49,1.69]/(0,0.29] 2.960*** 3.071*** 3.358*** 3.511*** 3.462***
## (0.009) (0.009) (0.007) (0.005) (0.009)
## carat.cut: (1.69,1.99]/(0,0.29] 3.113*** 3.216*** 3.520*** 3.705*** 3.658***
## (0.013) (0.013) (0.009) (0.007) (0.011)
## carat.cut: (1.99,6]/(0,0.29] 3.315*** 3.368*** 3.760*** 3.987*** 3.921***
## (0.009) (0.009) (0.007) (0.005) (0.010)
## DistPrefered 0.596*** 0.783*** 0.909*** 0.882***
## (0.020) (0.014) (0.011) (0.010)
## clarity: .L 0.863*** 0.899*** 0.878***
## (0.005) (0.004) (0.005)
## clarity: .Q -0.238*** -0.236*** -0.219***
## (0.005) (0.004) (0.004)
## clarity: .C 0.148*** 0.149*** 0.169***
## (0.004) (0.003) (0.004)
## clarity: ^4 -0.124*** -0.110*** -0.102***
## (0.003) (0.002) (0.003)
## clarity: ^5 0.178*** 0.193*** 0.196***
## (0.003) (0.002) (0.002)
## clarity: ^6 -0.076*** -0.086*** -0.071***
## (0.003) (0.002) (0.002)
## clarity: ^7 0.208*** 0.240*** 0.231***
## (0.002) (0.002) (0.002)
## color: .L 0.432*** 0.200***
## (0.002) (0.033)
## color: .Q -0.090*** -0.057
## (0.002) (0.031)
## color: .C 0.012*** 0.012
## (0.002) (0.025)
## color: ^4 0.010*** -0.015
## (0.002) (0.019)
## color: ^5 0.007*** 0.023
## (0.002) (0.013)
## color: ^6 0.001 0.012
## (0.002) (0.010)
## clarity: .L x color: .L 0.405***
## (0.015)
## clarity: .Q x color: .L 0.151***
## (0.014)
## clarity: .C x color: .L -0.062***
## (0.013)
## clarity: ^4 x color: .L 0.094***
## (0.009)
## clarity: ^5 x color: .L 0.059***
## (0.008)
## clarity: ^6 x color: .L -0.017*
## (0.008)
## clarity: ^7 x color: .L 0.128***
## (0.006)
## clarity: .L x color: .Q -0.005
## (0.014)
## clarity: .Q x color: .Q 0.110***
## (0.013)
## clarity: .C x color: .Q 0.039***
## (0.012)
## clarity: ^4 x color: .Q 0.031***
## (0.009)
## clarity: ^5 x color: .Q 0.022**
## (0.008)
## clarity: ^6 x color: .Q 0.063***
## (0.008)
## clarity: ^7 x color: .Q -0.011
## (0.006)
## clarity: .L x color: .C 0.063***
## (0.013)
## clarity: .Q x color: .C 0.087***
## (0.012)
## clarity: .C x color: .C 0.060***
## (0.011)
## clarity: ^4 x color: .C 0.036***
## (0.008)
## clarity: ^5 x color: .C 0.037***
## (0.007)
## clarity: ^6 x color: .C 0.020**
## (0.007)
## clarity: ^7 x color: .C 0.002
## (0.006)
## clarity: .L x color: ^4 0.098***
## (0.011)
## clarity: .Q x color: ^4 -0.004
## (0.011)
## clarity: .C x color: ^4 0.021*
## (0.009)
## clarity: ^4 x color: ^4 -0.005
## (0.007)
## clarity: ^5 x color: ^4 0.026***
## (0.006)
## clarity: ^6 x color: ^4 0.017**
## (0.006)
## clarity: ^7 x color: ^4 0.024***
## (0.005)
## clarity: .L x color: ^5 0.028**
## (0.010)
## clarity: .Q x color: ^5 0.006
## (0.009)
## clarity: .C x color: ^5 -0.017*
## (0.008)
## clarity: ^4 x color: ^5 0.008
## (0.006)
## clarity: ^5 x color: ^5 0.003
## (0.005)
## clarity: ^6 x color: ^5 -0.008
## (0.005)
## clarity: ^7 x color: ^5 0.007
## (0.004)
## clarity: .L x color: ^6 -0.023**
## (0.009)
## clarity: .Q x color: ^6 0.031***
## (0.008)
## clarity: .C x color: ^6 -0.021**
## (0.007)
## clarity: ^4 x color: ^6 -0.003
## (0.005)
## clarity: ^5 x color: ^6 0.001
## (0.005)
## clarity: ^6 x color: ^6 0.016***
## (0.004)
## clarity: ^7 x color: ^6 0.001
## (0.004)
## carat.cut: (0.29,0.39]/(0,0.29] x color: .L 0.300***
## (0.033)
## carat.cut: (0.39,0.49]/(0,0.29] x color: .L 0.177***
## (0.034)
## carat.cut: (0.49,0.69]/(0,0.29] x color: .L 0.300***
## (0.034)
## carat.cut: (0.69,0.89]/(0,0.29] x color: .L 0.301***
## (0.033)
## carat.cut: (0.89,0.99]/(0,0.29] x color: .L 0.295***
## (0.034)
## carat.cut: (0.99,1.19]/(0,0.29] x color: .L 0.312***
## (0.033)
## carat.cut: (1.19,1.49]/(0,0.29] x color: .L 0.356***
## (0.034)
## carat.cut: (1.49,1.69]/(0,0.29] x color: .L 0.381***
## (0.034)
## carat.cut: (1.69,1.99]/(0,0.29] x color: .L 0.360***
## (0.038)
## carat.cut: (1.99,6]/(0,0.29] x color: .L 0.320***
## (0.035)
## carat.cut: (0.29,0.39]/(0,0.29] x color: .Q 0.031
## (0.032)
## carat.cut: (0.39,0.49]/(0,0.29] x color: .Q 0.005
## (0.032)
## carat.cut: (0.49,0.69]/(0,0.29] x color: .Q 0.010
## (0.032)
## carat.cut: (0.69,0.89]/(0,0.29] x color: .Q 0.017
## (0.032)
## carat.cut: (0.89,0.99]/(0,0.29] x color: .Q 0.005
## (0.032)
## carat.cut: (0.99,1.19]/(0,0.29] x color: .Q -0.014
## (0.032)
## carat.cut: (1.19,1.49]/(0,0.29] x color: .Q -0.031
## (0.032)
## carat.cut: (1.49,1.69]/(0,0.29] x color: .Q -0.055
## (0.032)
## carat.cut: (1.69,1.99]/(0,0.29] x color: .Q -0.070
## (0.036)
## carat.cut: (1.99,6]/(0,0.29] x color: .Q -0.045
## (0.034)
## carat.cut: (0.29,0.39]/(0,0.29] x color: .C 0.016
## (0.026)
## carat.cut: (0.39,0.49]/(0,0.29] x color: .C 0.040
## (0.026)
## carat.cut: (0.49,0.69]/(0,0.29] x color: .C 0.021
## (0.026)
## carat.cut: (0.69,0.89]/(0,0.29] x color: .C 0.035
## (0.026)
## carat.cut: (0.89,0.99]/(0,0.29] x color: .C -0.010
## (0.026)
## carat.cut: (0.99,1.19]/(0,0.29] x color: .C 0.009
## (0.026)
## carat.cut: (1.19,1.49]/(0,0.29] x color: .C 0.005
## (0.026)
## carat.cut: (1.49,1.69]/(0,0.29] x color: .C -0.002
## (0.026)
## carat.cut: (1.69,1.99]/(0,0.29] x color: .C -0.019
## (0.030)
## carat.cut: (1.99,6]/(0,0.29] x color: .C -0.017
## (0.028)
## carat.cut: (0.29,0.39]/(0,0.29] x color: ^4 0.025
## (0.019)
## carat.cut: (0.39,0.49]/(0,0.29] x color: ^4 0.017
## (0.020)
## carat.cut: (0.49,0.69]/(0,0.29] x color: ^4 0.046*
## (0.020)
## carat.cut: (0.69,0.89]/(0,0.29] x color: ^4 0.018
## (0.019)
## carat.cut: (0.89,0.99]/(0,0.29] x color: ^4 0.046*
## (0.020)
## carat.cut: (0.99,1.19]/(0,0.29] x color: ^4 0.049*
## (0.019)
## carat.cut: (1.19,1.49]/(0,0.29] x color: ^4 0.047*
## (0.020)
## carat.cut: (1.49,1.69]/(0,0.29] x color: ^4 0.032
## (0.020)
## carat.cut: (1.69,1.99]/(0,0.29] x color: ^4 -0.006
## (0.024)
## carat.cut: (1.99,6]/(0,0.29] x color: ^4 -0.000
## (0.022)
## carat.cut: (0.29,0.39]/(0,0.29] x color: ^5 -0.019
## (0.014)
## carat.cut: (0.39,0.49]/(0,0.29] x color: ^5 -0.031*
## (0.015)
## carat.cut: (0.49,0.69]/(0,0.29] x color: ^5 -0.014
## (0.014)
## carat.cut: (0.69,0.89]/(0,0.29] x color: ^5 -0.017
## (0.014)
## carat.cut: (0.89,0.99]/(0,0.29] x color: ^5 -0.019
## (0.015)
## carat.cut: (0.99,1.19]/(0,0.29] x color: ^5 -0.012
## (0.014)
## carat.cut: (1.19,1.49]/(0,0.29] x color: ^5 -0.012
## (0.015)
## carat.cut: (1.49,1.69]/(0,0.29] x color: ^5 0.020
## (0.015)
## carat.cut: (1.69,1.99]/(0,0.29] x color: ^5 -0.011
## (0.020)
## carat.cut: (1.99,6]/(0,0.29] x color: ^5 0.015
## (0.017)
## carat.cut: (0.29,0.39]/(0,0.29] x color: ^6 -0.003
## (0.011)
## carat.cut: (0.39,0.49]/(0,0.29] x color: ^6 0.014
## (0.012)
## carat.cut: (0.49,0.69]/(0,0.29] x color: ^6 -0.000
## (0.011)
## carat.cut: (0.69,0.89]/(0,0.29] x color: ^6 -0.008
## (0.011)
## carat.cut: (0.89,0.99]/(0,0.29] x color: ^6 0.013
## (0.012)
## carat.cut: (0.99,1.19]/(0,0.29] x color: ^6 -0.009
## (0.011)
## carat.cut: (1.19,1.49]/(0,0.29] x color: ^6 -0.002
## (0.012)
## carat.cut: (1.49,1.69]/(0,0.29] x color: ^6 -0.015
## (0.012)
## carat.cut: (1.69,1.99]/(0,0.29] x color: ^6 -0.040*
## (0.017)
## carat.cut: (1.99,6]/(0,0.29] x color: ^6 -0.009
## (0.014)
## ---------------------------------------------------------------------------------------------------
## R-squared 0.933 0.934 0.966 0.981 0.983
## adj. R-squared 0.933 0.934 0.966 0.981 0.983
## sigma 0.262 0.260 0.187 0.139 0.133
## F 63989.415 59370.592 72568.453 99366.319 20856.337
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -3716.796 -3279.825 11850.305 25298.553 27534.042
## Deviance 3156.899 3097.293 1600.820 890.360 807.634
## AIC 7457.593 6585.650 -23660.610 -50545.105 -54812.083
## BIC 7562.390 6699.180 -23485.949 -50318.045 -53694.248
## N 45848 45848 45848 45848 45848
## ===================================================================================================

Fig 34. Results of fitted values for adding variables incrementally.
I also compared model performance to predict data between the training set and testing set. it appears that the model generalizes equally well to both the testing set.

Fig 35. Performance of model on training and tesing set